Introduction

My name is Sokona Mangane and I’m from Brooklyn, NY. I’m a senior at Bates College, majoring in Mathematics, and minoring in Digital and Computational Studies. In conjunction with the Institute for a Racially Just, Inclusive, and Open STEM Education (RIOS) Institute , I am conducting a computational text analysis of STEM Open Education Resources (OER). In particular, I’m analyzing the “Inclusive Teaching” Section descriptions on the website CourseSource, which is a open-access and peer-reviewed journal that publishes lessons, teaching content and resources related to biology and physics; in the words of Dr. Carrie Diaz Eaton (an Associate Professor of Digital and Computational Studies at Bates College and a co-founder of QUBES, known for her work in social justice in STEM higher education), it’s like a GitHub, but for curriculum.

When publishing an article on CourseSource (can be categorized as a “Lesson”, “Science Behind the Lesson”, “Teaching Tools and Strategies”, “Essay” or “Review”), authors can write about how the article is inclusive, under the “Inclusive Teaching”, however there are currently no guidelines. Thus, this text analysis of OER submissions serves to answer what do people write about Inclusive Teaching.

Setup/Data Cleaning

Here, I’ve imported the excel data set and necessary packages for analysis. I also did some data cleaning, created a vector of DEI related words and added some variables to the original excel data set.

#cmd + shift + c to comment out code 
#cmd + shift + M to print %>% pipe operator
#cmd + return to run code 
# install.packages("varhandle")
# install.packages("skimr")
# install.packages("tidyverse")
# install.packages("tidytext")
 # install.packages("stopwords")
# install.packages("wordcloud")
# install.packages("reshape2")
# install.packages("ggraph")
#install.packages("kableExtra")

#loading necessary packages
library(varhandle)
library(ggraph)
library(igraph)
library(skimr)
library(tidyverse)
library(tidytext)
library(ggplot2)
library(readr)
library(stopwords)
library(wordcloud)
library(reshape2)
library(kableExtra)
library(SnowballC)

#importing dataset and DEI words list
rios_data <- read_csv("RIOS Research - Course Source - Sheet1 2.csv")
dei_keywords <- read_csv("SJEDI_words 2022-12-20 18_03_42.csv")

#updating human error for one article
rios_data$`Inclusive Teaching  included?`[12] = "No"

#arranging years
rios_data <- rios_data %>%
  arrange(desc(Year))

# creating a new column article number, to number each article (most recent article: 286)
rios_data$article_num <- c(nrow(rios_data):1)
# I've created a variable which contains diversity related words (words pulled from the keywords column) and then combined it with the `dei_keywords` dataframe I imported (Thank you Dr. Diaz-Eaton). I also added another column, which includes the article number for each row.

# diversity_related <- c("diversity", "bias", "confirmation bias", "cognitive bias", "social justice", "broader impacts", "racism", "identity", "equity", "inclusivity", "environmental justice", "inclusion", "belonging")
# 
# #adding the vector above to the CSV dei_keywords
# for (x in 1:13){
#   dei_keywords[nrow(dei_keywords) + 1,] = diversity_related[x]
# }

Here, each word from the Inlcusive Teaching Description and keyword themes is “un-nested” into it’s own row and any unnecessary punctuation, numbers and words are removed. I saved each of these in their own new dataframes for analysis.

rios_data_tokenizedit <- rios_data %>%
  unnest_tokens(output = inclusive_teach_tokens, input = `Inclusive Teaching Description`)


#removing all rows with any punctuation, digits, or "stopwords" (~20k rows total)
strings <- c("[:punct:]", "[:digit:]","\\(","\\)")
stopwords_vec <- stopwords(language = "en")
stopwords_vec <- stopwords_vec[-c(165:167)]

#removed ~777 rows
rios_data_tokenizedit <- rios_data_tokenizedit %>%
  filter(!str_detect(inclusive_teach_tokens, paste(strings, collapse = "|")))

#removed ~19,663 rows
rios_data_tokenizedit <- rios_data_tokenizedit %>%
  filter(!inclusive_teach_tokens %in% stopwords_vec) 

#doing same thing as above but for keyword themes
# rios_data_tokenizedkt <- rios_data %>%
#   unnest_tokens(output = keyword_themes_tokens, input = `keyword themes`)


#removing all rows with any punctuation, digits, or "stopwords" (78 rows total)
# strings <- c("[:punct:]", "[:digit:]","\\(","\\)")
# stopwords_vec <- stopwords(language = "en")

#removed ~777 rows
# rios_data_tokenizedkt <- rios_data_tokenizedkt %>%
#   filter(!str_detect(keyword_themes_tokens, paste(strings, collapse = "|")))

#removed ~19,663 rows
# rios_data_tokenizedkt <- rios_data_tokenizedkt %>%
#   filter(!keyword_themes_tokens %in% stopwords_vec) 

Based on the work I did above, I transformed the data frame with all of the distinct ‘cleaned’ words in the inclusive teaching section into a csv, and manually verified if all 4,464 words (looked at the context of the words as required) should be counted as JEDI! After manual verification, I import it back into R.

#allwords <- unique(rios_data_tokenizedit$inclusive_teach_tokens)
#uniquedeirelated <- sapply(allwords, function(x) any(sapply(dei_keywords, str_detect, string = x)))

#uniquedei <- cbind(allwords,uniquedeirelated)

#write_csv(as.data.frame(uniquedeirelated), "DEIRelated.csv")


#importing manually verified list of JEDI words 
JEDI_keywords_df <- read_csv("cleanedITwords - cleanedITwords.csv")

JEDI_keywords <- JEDI_keywords_df %>% 
  filter(Carrie == "JEDI") %>% 
  select(1)

dei_related columns were created for each data frame, which says TRUE if that word (regarding the Inclusive Teaching Descriptions and the keyword themes) matches any word from the JEDI_keywords dataframe.

#creating a DEI related column
rios_data_tokenizedit$dei_relatedit = NA

# rios_data_tokenizedkt$dei_relatedkt = NA

rios_data_tokenizedit$dei_relatedit <- sapply(rios_data_tokenizedit$inclusive_teach_tokens, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))

# rios_data_tokenizedkt$dei_relatedkt <- sapply(rios_data_tokenizedkt$keyword_themes_tokens, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))

#save(dei_keywords, file = "dei_keywords.csv")
#saveRDS(rios_data_tokenized, file = "rios_data_tokenized.rds")

#removing the unnnecessary columns
rios_data_tokenizedit <- rios_data_tokenizedit[,-c(9:13)]
# rios_data_tokenizedkt <- rios_data_tokenizedkt[,-c(9:13)]

Exploratory Data Analysis: Word Count

Word Count of Inclusive Teaching Text Over Time Box Plot

The boxplot below visualizes the word count of the Inclusive Teaching Section over time. The Word count increases in 2019 compared to the years prior, and we also start to see more outliers. Overall, since the creation of Course Source, the word count of the Inclusive Teaching Section has increased.

#facotring years so i can reorder from oldest to most recent
rios_data$Year <- factor(rios_data$Year , levels=c("2022", "2021", "2020", "2019", "2018", "2017", "2016", "2015", "2014"))

#coloring groups (color blind safe colors) based on the shift (2014-2018 and 2019-2022)

  boxplot(`Word Count of Inclusive Teaching?`~ Year,
          data=rios_data,
          main="Word Count of Inclusive Teaching Sections Over Time",
          ylab="Year",
          xlab="Word count of Inclusive Teaching Section",
          horizontal = TRUE, 
          col = c("#1f78b4", "#1f78b4", "#1f78b4", "#005AB5", "#DC3220","#DC3220", "#DC3220", "#DC3220", "#DC3220"))
  abline(v = 118.4868, col = "#DC3220", lty = "solid", lwd = 3)
  abline(v = 216.0667, col = "#005AB5",lty = "solid", lwd = 3)
  legend("topright", inset=.02, title="Average Word Count",
   c("for 2014 - 2018","for 2019 - 2022"), fill=c("#DC3220", "#005AB5"), horiz=TRUE, cex=0.8)

rios_data$Year <-  unfactor(rios_data$Year)


#1f78b4
#b2df8a

Presented below is an in-depth look at what’s visualized above.

rios_data %>% 
  group_by(Year) %>% 
  skim(starts_with("Word Count")) %>% 
  select(3,4,6:13)  %>% 
  mutate(numeric.mean = round(numeric.mean, digits = 2), numeric.sd = round(numeric.sd, digits = 2), variance = (numeric.sd)^2) %>% 
  rename("Mean" = "numeric.mean",
         "SD" = "numeric.sd",
         "Variance" = "variance",
         "Min" = "numeric.p0",
         "25 Q" = "numeric.p25",
         "Median" = "numeric.p50",
         "75 Q" = "numeric.p75",
         "Max" = "numeric.p100",
         "Histogram" = "numeric.hist") %>% 
  kable() %>% 
  kable_minimal()
Year n_missing Mean SD Min 25 Q Median 75 Q Max Histogram Variance
2014 4 106.85 58.05 34 63.00 90.0 133.00 230 ▇▇▆▁▃ 3369.802
2015 4 122.57 61.70 43 70.50 116.0 174.25 228 ▇▅▃▂▅ 3806.890
2016 3 115.80 89.83 26 79.00 103.0 127.50 453 ▇▅▁▁▁ 8069.429
2017 2 123.00 56.55 37 83.50 107.0 154.00 238 ▃▇▃▃▂ 3197.903
2018 5 124.70 80.22 34 89.75 95.0 144.75 324 ▆▇▃▁▂ 6435.248
2019 4 173.33 114.77 25 98.75 141.5 213.75 483 ▆▇▃▁▂ 13172.153
2020 2 249.45 218.46 43 126.50 203.0 276.00 1415 ▇▂▁▁▁ 47724.772
2021 5 224.62 170.73 43 125.75 169.0 241.75 901 ▇▃▁▁▁ 29148.733
2022 1 210.74 118.28 41 124.00 183.0 249.50 565 ▅▇▂▁▁ 13990.158
#CREATING A BAR PLOT OF THE AVERAGE WORD COUNT OF THE TWO GROUPS

rios_data <- rios_data %>% 
  mutate(`Group Year` = case_when(
    as.numeric(Year) >= 5 ~ "2014 - 2018",
    as.numeric(Year) < 5 ~ "2019 - 2022"
  )) #creating column that groups based on the shift (2014-2018 and 2019-2022), used 5 because it's factored and 2018 = level 5

# test <- t.test(formula = `Word Count of Inclusive Teaching?` ~ `Group Year`,
#          data = rios_data)
# 
# test #p-value = 1.557e-10, confidence interval:  -126.37537  -68.78427, df = 254?!

rios_data %>%  
  group_by(`Group Year`) %>% 
  summarise(Avg_Wrd_Ct = mean(`Word Count of Inclusive Teaching?`, na.rm = TRUE), #calculating avg wrd count for each group
            n = n(),
            sd = sd(`Word Count of Inclusive Teaching?`, na.rm = TRUE)) %>% #calculating standard deviation for each group (for confidence intervals) %>% 
  mutate(se = sd/sqrt(n), #calculating the standard error
         ic = se * qt((1-0.05)/2 + 0.5, n-1)) %>% #calculating standard error * value of the t-distribution for 0.5
  ggplot(aes(`Group Year`, Avg_Wrd_Ct, fill = `Group Year`)) +
  geom_col() + 
  labs(title = "Average Word Count of Inclusive Teaching Section", subtitle = "Before and After 2018", x = "Year", y = "Average Word Count") + 
  scale_fill_manual(values = c("#005AB5", "#DC3220")) + #manually coloring groups (color blind safe colors) 
  geom_errorbar(aes(x = `Group Year`, ymin = Avg_Wrd_Ct - ic, ymax = Avg_Wrd_Ct + ic), width = 0.4)

Exploratory Data Analysis: Word Frequency

What are the most common “DEI” Words in the Inclusive Teaching Description?

Only 2.6% of words in the Inclusive Teaching Text are DEI related (118/4,464). Looking at the most common DEI words gives us an idea of what DEI words are being used the most, and what does that tell us about how the authors are being inclusive. According to the table below, the words “inclusive”, “diversity”, and “diverse” are the most commons “DEI” words. This makes sense as inclusive teaching should be diverse and cater to a diversity of racial backgrounds. Out of 118 “DEI” words used in the inclusive teaching text, note that 70% of these words are repeated more than once (83/118) and 54.2% are repeated more than twice (64/118). Based on these common words, it seems like these articles try to be inclusive by being diverse, engaging, and catering to a diverse set of backgrounds and abilities.

However, the title of this section for which these descriptions are under is called “Inclusive teaching”, so one could have a lengthy description under this section, without including any of the words from dei_keywords and then mention “inclusive teaching” to be included in this category, thus, these numbers may be an underestimate. Below (2 gram analysis) you can find a data frame rios_2w_count that portrays the most common DEI 2 words phrases.

#BSS = BASIC SUMMARY STATISTICS

#most common DEI words, out of 118 (out of 4,464 words, 2.6% are DEI)
rios_data_tokenizedit %>%
  filter(dei_relatedit == "TRUE") %>%
  count(inclusive_teach_tokens, sort = TRUE)  
## # A tibble: 785 × 2
##    inclusive_teach_tokens     n
##    <chr>                  <int>
##  1 students                1389
##  2 inclusive                123
##  3 diversity                115
##  4 diverse                  100
##  5 opportunity               94
##  6 individual                72
##  7 engage                    66
##  8 environment               64
##  9 backgrounds               57
## 10 participate               57
## # ℹ 775 more rows
# #creating stemmed DEI word list
# DEI_related <- rios_data_tokenizedit %>% 
#   filter(dei_relatedit == "TRUE") %>% 
#   select(10) #CREATING DF OF DEI INCLUSIVE TEACH TOKENS

#creating column of stemmed tokens
rios_data_tokenizedit <- rios_data_tokenizedit %>% 
  mutate(stem = wordStem(rios_data_tokenizedit$inclusive_teach_tokens, language = "en")) %>% 
  rename("inclusive_tokens_stem" = "stem")

#creating stemmed JEDI list
stem_jedi <- JEDI_keywords %>% 
  mutate(stem = wordStem(`allwords`, language = "en")) %>% 
  count(stem, sort = TRUE)

#stem_token_it$dei_related <- sapply(stem_token_it$stem, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))

rios_data_tokenizedit %>% 
  filter(dei_relatedit == "TRUE") %>% 
  count(inclusive_tokens_stem, sort = TRUE)
## # A tibble: 497 × 2
##    inclusive_tokens_stem     n
##    <chr>                 <int>
##  1 student                1389
##  2 divers                  215
##  3 inclus                  181
##  4 particip                167
##  5 engag                   165
##  6 opportun                143
##  7 encourag                142
##  8 individu                133
##  9 addit                    87
## 10 access                   85
## # ℹ 487 more rows

Word Cloud

Word clouds are another way of visualizing which words are being used the most. This word cloud shows the distinct words printed in the table above.

#ORIGINAL

rios_data_tokenizedit %>%
  filter(dei_relatedit == "TRUE") %>%
  count(inclusive_teach_tokens, sort = TRUE) 
## # A tibble: 785 × 2
##    inclusive_teach_tokens     n
##    <chr>                  <int>
##  1 students                1389
##  2 inclusive                123
##  3 diversity                115
##  4 diverse                  100
##  5 opportunity               94
##  6 individual                72
##  7 engage                    66
##  8 environment               64
##  9 backgrounds               57
## 10 participate               57
## # ℹ 775 more rows
#STEMMED AND LOGGED WORD CLOUD
rios_data_tokenizedit %>%
  filter(dei_relatedit == "TRUE") %>%
  count(inclusive_tokens_stem, sort = TRUE) %>% 
  mutate(log_n = log(n)) %>% #since there are so many words and it's very skewed i'm log scaling the count and then visualizing
  with(wordcloud(inclusive_tokens_stem, n, min.freq = 2)) #words w/ frequency below 2 won't be plotted, this elimates about 33.2% of the data

The tables and visuals above give us an idea of how often are DEI Words used and what does that say about inclusivity of the articles. However, looking at the most commonly used DEI words doesn’t give us all the information on how the article is being inclusive and their definitions of it. Thus, I’ll repeat the analyses I did above, but looking at phrases, specifically of 2 words. Unlike above, I looked through all the phrases and removes what I felt was unnecessary and/or didn’t make sense. Below we have printed the most common DEI phrases. Although some of the phrases aren’t repeated often, they show that the definition of inclusive teaching goes beyond just engaging all students.

Common DEI Phrases

2 words

rios_data_token2it <- rios_data %>%
  unnest_tokens(it_tokens_2w, `Inclusive Teaching Description`, token = "ngrams", n = 2)  %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_data_token2it <- rios_data_token2it %>%
  filter(!word1 %in% stopwords_vec) %>%
  filter(!word2 %in% stopwords_vec) %>% 
  unite(it_tokens_2w, word1, word2, sep = " ") 

rios_data_token2it <- rios_data_token2it %>%
  filter(!str_detect(it_tokens_2w, paste(strings, collapse = "|")))

# code to print most common DEI 2word phrases for review

# #for review (Naz?)
# all2words <- as.data.frame(unique(rios_data_token2it$it_tokens_2w))
# all2words$dei_related = NA
# all2words$dei_related <- sapply(all2words$`unique(rios_data_token2it$it_tokens_2w)`, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))
# write_csv(all2words, "2DEIRelated.csv")
# 
# 
# #most common DEI words
# rios_2w_count <- rios_data_token2it %>%
#   filter(dei_related == "TRUE") %>%
#   count(it_tokens_2w, sort = TRUE) 
# 
# write_csv(rios_2w_count, "rios2wcount.csv")

#creating a DEI related column
rios_data_token2it$dei_related = NA


#importing manually verified list of JEDI 2 word phrases 
JEDI_2keywords_df <- read_csv("cleanedIT2words.csv")

#filtering words that haven't been checked and aren't JEDI
JEDI_2keywords <- JEDI_2keywords_df %>% 
  filter(JEDI_2keywords_df$...5 != "unsure" | is.na(JEDI_2keywords_df$...5)) %>% 
  filter(...3 == "JEDI") %>% 
  select(1)
  


rios_data_token2it$dei_related <- sapply(rios_data_token2it$it_tokens_2w, function(x) any(sapply(JEDI_2keywords, str_detect, string = x)))


#most common DEI words
rios_2w_count <- rios_data_token2it %>%
  filter(dei_related == "TRUE") %>%
  count(it_tokens_2w, sort = TRUE)

#graph of that 
rios_2w_count %>%
  top_n(30) %>%
  mutate(it_tokens_2w = reorder(it_tokens_2w, n)) %>%
  ggplot(aes(it_tokens_2w, n)) +
  geom_col() +
  coord_flip() +
  labs(y = "(DEI Related) 2 Word Count in Inclusive Teaching Text") + 
  xlab(NULL)

3 words

I also did a 3 gram word Analysis, which has a much a lower frequency. However, this gives us a better idea of what “inclusive teaching” means in these contexts.

## plot the frequency
rios_data_token3it <- rios_data %>%
  unnest_tokens(it_tokens_3w, `Inclusive Teaching Description`, token = "ngrams", n = 3)  %>%
  separate(it_tokens_3w, c("word1", "word2", "word3"), sep = " ")

rios_data_token3it <- rios_data_token3it %>%
  filter(!word1 %in% stopwords_vec) %>%
  filter(!word2 %in% stopwords_vec) %>%
  filter(!word3 %in% stopwords_vec) %>%
  unite(it_tokens_3w, word1, word2, word3, sep = " ") 

rios_data_token3it <- rios_data_token3it %>%
  filter(!str_detect(it_tokens_3w, paste(strings, collapse = "|")))


# code to print most common DEI 3word phrases for review

# all3words <- as.data.frame(unique(rios_data_token3it$it_tokens_3w))
# all3words$dei_related = NA
# all3words$dei_related <- sapply(all3words$`unique(rios_data_token3it$it_tokens_3w)`, function(x) any(sapply(JEDI_2keywords, str_detect, string = x)))
# #write_csv(all2words, "2DEIRelated.csv")
# 
# 
# #most common DEI words
# all3words <- all3words %>%
#   filter(dei_related == "TRUE") %>%
#   count(`unique(rios_data_token3it$it_tokens_3w)`, sort = TRUE)
# 
# write_csv(all3words, "all3words.csv")



#creating a DEI related column
rios_data_token3it$dei_related = NA


rios_data_token3it$dei_related <- sapply(rios_data_token3it$it_tokens_3w, function(x) any(sapply(dei_keywords, str_detect, string = x)))


#removing the unnnecessary columns/rows
rios_data_token3it <- rios_data_token3it[,-c(9:13)]


#for review (Naz?)
# all3words <- as.data.frame(unique(rios_data_token3it$it_tokens_3w))
# all3words$dei_related = NA
# all3words$dei_related <- sapply(all3words$`unique(rios_data_token3it$it_tokens_3w)`, function(x) any(sapply(JEDI_keywords, str_detect, string = x)))
# write_csv(all3words, "3DEIRelated.csv")


#graph of the most common
rios_data_token3it %>%
  filter(dei_related == "TRUE") %>%
  count(it_tokens_3w, sort = TRUE) %>%
  top_n(30) %>%
  mutate(it_tokens_3w = reorder(it_tokens_3w, n)) %>%
  ggplot(aes(it_tokens_3w, n)) +
  geom_col() +
  coord_flip() +
  labs(y = "(DEI Related) 3 Word Count in Inclusive Teaching Text") + 
  xlab(NULL)

In Depth Bar Chart of Word Count By Year

This is a bar chart that looks more in depth into the frequency of words being used, each year. You can click on each tab to see to compare. Although 2014-2016 have higher proportions of DEI words, from 2018-2022, the top 20 DEI related words are used more frequently, specifically the words “diverse”, “diversity”, “engage”, “individual”, and “inclusive” are being used more. We can see that the number and diversity of words increases (there are decreases in 2018, and 2021) each year.

# Below is just the code to print all 9 years into one table

#saving for visuals on word counts, etc
it_word_counts <- rios_data_tokenizedit %>%
  filter(dei_relatedit == "TRUE") %>%
  group_by(Year) %>%
  count(inclusive_teach_tokens, sort = TRUE)

it_stem_word_counts <- rios_data_tokenizedit %>%
  filter(dei_relatedit == "TRUE") %>%
  group_by(Year) %>%
  count(inclusive_tokens_stem, sort = TRUE)

#original
it_word_counts %>%
  # filter(Year != 2018) %>%
  filter(n > 1) %>% #41.58 of data
  ggplot(aes(inclusive_teach_tokens, n)) +
  geom_col() +
 #geom_text(aes(label = inclusive_teach_tokens), vjust = -0.5, size = 1, nudge_y = 1) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  facet_wrap(~Year, ncol = 2) +
  ylim(0,355) +
  labs(title = "(DEI Related) Word Frequency Over Time", x = "(DEI Related) Word", y = "Word Count in Inclusive Teaching Text") +
   ## reduce spacing between labels and bars
  scale_x_discrete(expand = c(.01, .01)) +
  scale_fill_identity(guide = "none") +
  ## get rid of all elements except y axis labels + adjust plot margin +
  theme(axis.text.y = element_text(size = 14, hjust = 1, family = "Fira Sans"),
        plot.margin = margin(rep(15, 4)))

it_stem_word_counts %>%
  # filter(Year != 2018) %>%
  filter(n > 1) %>% #41.58 of data
  ggplot(aes(inclusive_tokens_stem, n)) +
  geom_col() +
 #geom_text(aes(label = inclusive_teach_tokens), vjust = -0.5, size = 1, nudge_y = 1) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  facet_wrap(~Year, ncol = 2) +
  ylim(0,355) +
  labs(title = "(DEI Related) Stemmed Word Frequency Over Time", x = "(DEI Related) Word", y = "Word Count in Inclusive Teaching Text") +
   ## reduce spacing between labels and bars
  scale_x_discrete(expand = c(.01, .01)) +
  scale_fill_identity(guide = "none") +
  ## get rid of all elements except y axis labels + adjust plot margin +
  theme(axis.text.y = element_text(size = 7, hjust = 1, family = "Fira Sans"),
        plot.margin = margin(rep(15, 4)))

# it_word_counts %>%
#   ungroup(Year) %>%
#   count(n) %>%
#   mutate(percent = (nn/416)*100)

2014

# it_word_counts %>% 
#   filter(inclusive_teach_tokens != "students") %>% 
#   group_by(Year) %>% 
#   summarise(max = max(n))
# 
# it_stem_word_counts %>% 
#   filter(inclusive_tokens_stem != "student") %>% 
#   group_by(Year) %>% 
#   summarise(max = max(n))

it_word_counts %>%
  filter(Year == 2014) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,35)

2015

it_word_counts %>%
  filter(Year == 2015) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,35)

2016

it_word_counts %>%
  filter(Year == 2016) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,35)

2017

it_word_counts %>%
  filter(Year == 2017) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,35)

2018

it_word_counts %>%
  filter(Year == 2018) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,35)

2019

it_word_counts %>%
  filter(Year == 2019) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,35)

2020

it_word_counts %>%
  filter(Year == 2020) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,35)

2021

it_word_counts %>%
  filter(Year == 2021) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,35)

2022

it_word_counts %>%
  filter(Year == 2022) %>%
  head(20) %>%
  ggplot(aes(inclusive_teach_tokens, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,35)

In Depth Bar Chart of Stemmed Word Count By Year

2014

it_stem_word_counts %>%
  filter(Year == 2014) %>%
  head(20) %>%
  ggplot(aes(inclusive_tokens_stem, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,60)

2015

it_stem_word_counts %>%
  filter(Year == 2015) %>%
  head(20) %>%
  ggplot(aes(inclusive_tokens_stem, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,60)

2016

it_stem_word_counts %>%
  filter(Year == 2016) %>%
  head(20) %>%
  ggplot(aes(inclusive_tokens_stem, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,60)

2017

it_stem_word_counts %>%
  filter(Year == 2017) %>%
  head(20) %>%
  ggplot(aes(inclusive_tokens_stem, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,60)

2018

it_stem_word_counts %>%
  filter(Year == 2018) %>%
  head(20) %>%
  ggplot(aes(inclusive_tokens_stem, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,60)

2019

it_stem_word_counts %>%
  filter(Year == 2019) %>%
  head(20) %>%
  ggplot(aes(inclusive_tokens_stem, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,60)

2020

it_stem_word_counts %>%
  filter(Year == 2020) %>%
  head(20) %>%
  ggplot(aes(inclusive_tokens_stem, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,60)

2021

it_stem_word_counts %>%
  filter(Year == 2021) %>%
  head(20) %>%
  ggplot(aes(inclusive_tokens_stem, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,60)

2022

it_stem_word_counts %>%
  filter(Year == 2022) %>%
  head(20) %>%
  ggplot(aes(inclusive_tokens_stem, n)) + 
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  ylim(0,60)

Normalized Word Frequency: The Most Distinctive Words By Year

This can help us see the “weight” of each words and the words most distinctive for each year. Here we’re printing the idf, which is the “inverse document frequency, which decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents. This can be combined with term frequency to calculate a term’s tf-idf [the term frequency and idf multiplied together], the frequency of a term adjusted for how rarely it is used. The statistic tf-idf is intended to measure how important a word is to a document in a collection (or corpus) of documents, for example, to one novel in a collection of novels or to one website in a collection of websites”.

For comparisons purposes, the y axis has the same limits for all the graphs, where we can see that the words “cultured” and disengaged” are the most distinct words compared to the other DEI related words for the rest of the years. These results make sense and align with the visuals above. Because the use of these words are very low in 2014 (besides diversity and diverse which was seen more than 5 times in 2014), they have a high tf-idf statistic. These words have been used more each year.

2014

#finding the most distinctive words for each document
it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2014) %>%
  top_n(40) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2014", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)

2015

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2015) %>%
  top_n(40) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2015", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)

2016

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2016) %>%
  top_n(40) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2016", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)

2017

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2017) %>%
  top_n(40) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2017", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)

2018

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2018) %>%
  top_n(40) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2018", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)

2019

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2019) %>%
  top_n(40) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2019", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)

2020

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2020) %>%
  top_n(40) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2020", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)

2021

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2021) %>%
  top_n(40) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2021", x = "DEI Related Words", y = "Word Weight (tf-idf statistic") + ylim(0,0.04)

2022

it_word_counts %>%
  bind_tf_idf(inclusive_teach_tokens, Year, n) %>%
  arrange(desc(tf_idf)) %>%
  filter(Year == 2022) %>%
  top_n(40) %>%
  mutate(it_tokens_3w = reorder(inclusive_teach_tokens,tf_idf)) %>%
  ggplot(aes(inclusive_teach_tokens, tf_idf)) +
  geom_bar(stat = "identity", show.legend = FALSE) + 
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
labs(title = "(DEI Related) Normalized Word Frequency in Inclusive Teaching Text in 2022", x = "DEI Related Words", y = "Word Weight (tf-idf statistic)") + ylim(0,0.04)

Network Plot of Word Relationship over Time

To get a deeper understanding of how inclusive teaching is viewed, we will be creating a network plot to look at the relationship between words/phrases in the inclusive teaching section. The generated igraph graph is called rios_phrase_network. It has 41 words and 36 connections among them. Similar to what some of our graphs have portrayed above, the words “inclusive”, “students”, and “diverse” are connected to many other words.

2014

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2014) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
  labs(title = "Network Plot of (DEI Related) Word Relationship in 2014")

2015

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2015) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2016

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2016) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2017

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2017) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2018

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2018) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2019

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2019) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2020

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2020) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2021

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2021) %>%
  count(word1, word2, sort = TRUE) %>%  
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)

2022

rios_data_token2 <- rios_data_token2it %>%
  separate(it_tokens_2w, c("word1", "word2"), sep = " ")

rios_phrase_network <- rios_data_token2 %>% 
  filter(dei_related == TRUE & Year == 2022) %>%
  count(word1, word2, sort = TRUE) %>% 
  graph_from_data_frame()


set.seed(20181005)

a <- arrow(angle = 30, length = unit(0.1, "inches"), ends = "last", type = "open")

ggraph(rios_phrase_network, layout = "fr") + geom_edge_link(aes(color = n, width = n), arrow = a) + 
    geom_node_point() + geom_node_text(aes(label = name), vjust = 1, hjust = 1)  +
  labs(title = "Network Plot of (DEI Related) Word Relationship in 2022")

References

CourseSource. QUBES. (n.d.). Retrieved October 2022, from https://qubeshub.org/community/groups/coursesource/

Dewsbury, B., & Brame, C. J. (2019). Inclusive teaching. CBE—Life Sciences Education, 18(2), 1–5. https://doi.org/10.1187/cbe.19-01-0021